Is there a relationship between a good or bad loan and the time between an account is opened an the loan is created? Is there a specific set of accounts that seem to be at higher or lower risk of defaulting?

R solution

library(ggplot2)
library(plyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(plotly)
## Warning: package 'plotly' was built under R version 3.6.2
## 
## Attaching package: 'plotly'
## The following objects are masked from 'package:plyr':
## 
##     arrange, mutate, rename, summarise
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
#read data
accounts  = read.csv('data/accounts_analytical.csv', header = TRUE)
#data pre-processing
loans = accounts[!is.na(accounts$loan_amount), c('account_id', "acct_creation_date", "loan_date", "loan_amount", "loan_payment", "loan_term", "loan_status", "loan_default")]

loans$acct_creation_date<-as.Date(as.character(loans$acct_creation_date),format="%Y-%M-%d")
loans$loan_date<-as.Date(as.character(loans$loan_date),format="%Y-%M-%d")
loans$range =  loans$loan_date - loans$acct_creation_date
temp_default = loans[,-c(1:3)]
temp_default$loan_status = as.numeric(temp_default$loan_status)
temp_default$range= as.numeric(temp_default$range)
temp_default$loan_default= as.numeric(temp_default$loan_default)


M1= cor(as.matrix(temp_default))
range1 = M1[,'range']

fig <- plot_ly(x = names(range1), y = range1, type = 'bar')
fig <- fig %>% layout(title = "Correlation between time and other loan factors", yaxis = list(title = 'Count'), barmode = 'group')

fig

From the above comparsion, we could see the correlation between loan default status and time between an account is opened and the loan is created is very very low. If we take a closer look in the graphs below, we could see the densities of time are very similar regardless the default or non-default status. Therefore, we couldn’t determine if there a specific set of accounts that seem to be at higher or lower risk of defaulting with the data we have.

loans$range = as.numeric(loans$range)
scatterPlot = ggplot(loans, aes(x=range, y=loan_amount, color=loan_default)) +
  geom_point(alpha = 0.5)+
  labs(title="Time range vs. loan_amount by default",x="time (days)", y = "loan_amount" )
scatterPlot

xdensity <- ggplot(loans, aes(x =range, fill=loan_default, color=loan_default)) + 
  geom_density(alpha=.5) + 
  labs(title="Time range density plot by default",x="time (days)", y = "loan_amount" )
 xdensity

# Marginal density plot of y (right panel)
ydensity <- ggplot(loans, aes(loan_amount, fill=loan_default, color=loan_default)) + 
  geom_density(alpha=.5) + 
  labs(title="loan_amount density plot by default",x="time (days)", y = "loan_amount" )
ydensity